feat: unify config-driven retry across VLM and embedding by snemesh · Pull Request #1049 · volcengine/OpenViking

snemesh · 2026-03-28T02:00:00Z

Summary

Closes #922

Unifies retry behavior across all VLM and embedding paths with a shared transient retry module, replacing scattered per-backend implementations.

Changes

New module openviking/models/retry.py — is_transient_error(), transient_retry(), transient_retry_async() with error classification (28 error types), exponential backoff with jitter, count-based retry
VLM — Remove function-level max_retries param from get_completion_async(). Retry is now config-driven via vlm.max_retries (default 3, was 2). Add retry to get_vision_completion_async() (was zero retry) and sync methods
Embedding — Add embedding.max_retries config (default 3). Apply unified retry to all 8 providers (OpenAI, Volcengine, VikingDB, Gemini, MiniMax, Jina, Voyage, LiteLLM)
SDK retry disabled everywhere to prevent double-retry explosion (SDK retry × our retry)
Kwargs migration — Switch VLM call chains from positional to keyword arguments (fixes pre-existing tool_choice/messages positional bug in vlm_config.py)
Backward compatible — exponential_backoff_retry() unchanged, max_retries=0 disables retry

Config

vlm:
  max_retries: 3    # default, was 2

embedding:
  max_retries: 3    # new field

Breaking Changes

max_retries parameter removed from VLMBase.get_completion_async(), StructuredVLM.complete_json_async(), StructuredVLM.complete_model_async(), VLMConfig.get_completion_async()
Retry is now fully config-driven

Error Classification

Retryable: HTTP 429/500/502/503/504, TooManyRequests, RateLimit, RequestBurstTooFast, ConnectionError, TimeoutError, openai.RateLimitError

Not retryable: HTTP 400/401/403/404/422, InvalidRequestError, AuthenticationError, unknown errors (conservative default)

Test Plan

84 new tests across 5 files
Error classification: 28 parametrized cases (transient vs permanent)
Retry behavior: sync + async, backoff verification, edge cases
VLM backend integration: retry on 429, no retry on 401, vision retry, SDK disabled
Embedding provider integration: OpenAI + VikingDB retry verification
Config flow: end-to-end from ov.conf to backend instance
Backward compatibility: exponential_backoff_retry unchanged

github-actions · 2026-03-28T02:00:57Z

Failed to generate code suggestions for PR

MaojiaSheng · 2026-03-30T07:09:04Z

there are some conflicts that need to be resolved, thanks

Implements `openviking/models/retry.py` with `is_transient_error`, `transient_retry`, and `transient_retry_async` — a single config-driven retry layer replacing scattered per-backend implementations. Adds 50 unit tests covering classification, backoff, jitter, exhaustion, and custom predicates.

- VLMBase: change max_retries default 2→3, remove max_retries param from get_completion_async abstract signature - OpenAI backend: wrap all 4 methods with transient_retry/transient_retry_async, disable SDK retry (max_retries=0 in client constructors), remove manual for-loop retry - VolcEngine backend: same pattern — transient_retry for all methods, remove manual for-loop retry - LiteLLM backend: same pattern — transient_retry for all methods, remove manual for-loop retry

volcengine#922) - VLMConfig: default max_retries 2→3, remove max_retries from get_completion_async signature, switch all wrappers to kwargs - StructuredVLM (llm.py): remove max_retries from complete_json_async and complete_model_async, switch all internal calls to kwargs - memory_react.py: remove max_retries=self.vlm.max_retries (now handled internally by backend) - Update test stubs to match new signatures (remove max_retries=0)

Tests cover OpenAI backend as representative: - Completion retries on 429, does NOT retry on 401 - Vision completion now retries (was zero before) - Config max_retries is used (default=3) - max_retries removed from get_completion_async signature (all backends) - OpenAI SDK retry disabled (max_retries=0 in client constructors)

- EmbeddingConfig: новое поле max_retries (default=3) для конфигурации retry - EmbeddingConfig._create_embedder(): инжектирует max_retries в params["config"] - EmbedderBase.__init__(): извлекает max_retries из config dict

- OpenAI: отключить SDK retry (max_retries=0), обернуть embed/embed_batch - Volcengine: заменить exponential_backoff_retry на transient_retry, убрать is_429_error - VikingDB: добавить transient_retry (ранее retry отсутствовал) - Gemini: отключить SDK HttpRetryOptions (attempts=1), обернуть embed/embed_batch - MiniMax: отключить urllib3 Retry (total=0), обернуть embed/embed_batch - Jina: отключить SDK retry (max_retries=0), обернуть embed/embed_batch - Voyage: отключить SDK retry (max_retries=0), обернуть embed/embed_batch - LiteLLM: обернуть litellm.embedding() вызовы Все провайдеры теперь используют единый transient_retry с is_transient_error для классификации ошибок. Wrapper размещён ВНУТРИ метода вокруг raw API call, ДО try/except который конвертирует в RuntimeError.

- test_embedding_retry_integration: OpenAI и VikingDB retry на transient/permanent ошибки - test_retry_config: VLMConfig и EmbeddingConfig max_retries поля и defaults - test_backward_compat: exponential_backoff_retry importable, signature unchanged, time-based

After upstream refactored sync methods to use run_async(), only transient_retry_async is needed in volcengine_vlm.py.

snemesh · 2026-03-30T16:07:33Z

Conflicts resolved

Rebased onto current main. Here's a summary of the conflict resolution, particularly around volcengine_vlm.py which had the most significant changes.

Background

Two upstream commits landed after this PR was branched:

ce99887 — License change Apache-2.0 → AGPL-3.0 (touched all file headers)
a66a1a6 — Refactor memory extract v2 (Refactor memory extract v2 #1045) + VolcEngine Responses API with prefix caching

The license conflicts were trivial (accept upstream). The interesting part is how retry integrates with the new VolcEngine architecture.

`volcengine_vlm.py` — Retry + Responses API

Before (old architecture): VolcEngine used client.chat.completions.create() directly. Our retry wrapped that call:

async def _call():
    return await client.chat.completions.create(**kwargs)
response = await transient_retry_async(_call, max_retries=self.max_retries)

After (new architecture): Upstream refactored to use VolcEngine Responses API with prefix caching via responseapi_prefixcache_completion(). Sync methods now delegate through run_async().

Resolution: Retry wraps the high-level responseapi_prefixcache_completion call:

async def _call():
    return await self.responseapi_prefixcache_completion(
        static_segments=static_segments,
        dynamic_messages=dynamic_messages,
        response_format=response_format,
        tools=tools,
        tool_choice=tool_choice,
    )
response = await transient_retry_async(_call, max_retries=self.max_retries)

Why this is safe:

Retry is placed at the same abstraction level — around the API call, before response processing
Prefix cache is deterministic (same input → same cache key), so retries won't cause cache inconsistency
Sync methods (get_completion, get_vision_completion) delegate to async via run_async(), so they inherit retry automatically
get_vision_completion_async delegates to get_completion_async, so vision also gets retry
Only transient_retry_async is imported (removed unused transient_retry since sync paths use run_async)

`memory_react.py` — Accepted deletion

Upstream deleted this file in #1045. Our branch had modified the get_completion_async call to pass max_retries=. Since retry is now config-driven and max_retries was removed from the signature, this change is no longer needed. Accepted the deletion.

Other files

openai_vlm.py, llm.py, vlm_config.py — straightforward: accepted upstream formatting/license, applied our kwargs migration and retry on top
openai_embedders.py — kept both imports (transient_retry + DEFAULT_AZURE_API_VERSION)

Test results

All 84 tests from this PR pass. No regressions introduced (the 71 pre-existing failures on main are unchanged).

- Rename unused loop vars i, item → _i, _item (B007) - Suppress unused has_images assignment (F841) — upstream code, kept for clarity

test_fs_tree intermittently returns 500 on windows-latest when HAS_SECRETS=false — the resource processing pipeline retries failed embeddings (401 auth errors) in the background, causing server load that affects the fs/tree endpoint.

qin-ctx · 2026-03-31T06:40:31Z

For context, we have also been working on a very similar implementation in parallel in #926.

Functionally, the direction is the same: move retry down to the model layer, make it config-driven, and apply it consistently across VLM and embedding instead of adding more call-site wrappers.

The main differences vs our in-flight branch are:

feat: unify config-driven retry across VLM and embedding #1049 lands the broad VLM + embedding unification and the config contract directly in one PR;
our branch also centralizes transient-error classification in a shared utility and reuses the same classification in the circuit-breaker path, so retry and breaker behavior stay aligned;
our current test coverage emphasis is a bit different, with some extra config/wiring and integration-path verification around the surrounding call flow.

Our follow-up plan on our side is to converge on a single bottom-layer retry implementation rather than keep two parallel versions, keep the config-driven contract as the baseline, and only carry forward the pieces that still add value on top of #1049 (for example shared classifier reuse and any targeted behavior/test gaps we find).

…)" (#1113) This reverts commit a092d64.

github-project-automation bot added this to OpenViking project Mar 28, 2026

github-project-automation bot moved this to Backlog in OpenViking project Mar 28, 2026

qin-ctx self-assigned this Mar 28, 2026

snemesh force-pushed the feature/unified-retry branch from e73415d to b59fd97 Compare March 28, 2026 10:32

snemesh added 11 commits March 30, 2026 17:46

style: ruff format для всех изменённых файлов

23736d6

style: fix ruff lint errors (import sorting, unused imports)

02d9c73

style: format test_embedding_retry_integration.py

2d1d24e

style: remove unused transient_retry import from volcengine_vlm

03ee476

After upstream refactored sync methods to use run_async(), only transient_retry_async is needed in volcengine_vlm.py.

snemesh force-pushed the feature/unified-retry branch from 6f9439b to 03ee476 Compare March 30, 2026 16:02

snemesh added 2 commits March 30, 2026 18:16

style: fix ruff lint errors in volcengine_vlm.py

443a0c6

- Rename unused loop vars i, item → _i, _item (B007) - Suppress unused has_images assignment (F841) — upstream code, kept for clarity

MaojiaSheng approved these changes Mar 31, 2026

View reviewed changes

MaojiaSheng merged commit a092d64 into volcengine:main Mar 31, 2026
11 checks passed

github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 31, 2026

qin-ptr mentioned this pull request Mar 31, 2026

Revert "feat: unify config-driven retry across VLM and embedding" #1113

Merged

qin-ctx pushed a commit that referenced this pull request Mar 31, 2026

Revert "feat: unify config-driven retry across VLM and embedding (#1049…

fe6c465

…)" (#1113) This reverts commit a092d64.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: unify config-driven retry across VLM and embedding#1049

feat: unify config-driven retry across VLM and embedding#1049
MaojiaSheng merged 13 commits intovolcengine:mainfrom
snemesh:feature/unified-retry

snemesh commented Mar 28, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 28, 2026

Uh oh!

MaojiaSheng commented Mar 30, 2026

Uh oh!

snemesh commented Mar 30, 2026

Uh oh!

Uh oh!

qin-ctx commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

snemesh commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Config

Breaking Changes

Error Classification

Test Plan

Uh oh!

github-actions bot commented Mar 28, 2026

Uh oh!

MaojiaSheng commented Mar 30, 2026

Uh oh!

snemesh commented Mar 30, 2026

Conflicts resolved

Background

volcengine_vlm.py — Retry + Responses API

memory_react.py — Accepted deletion

Other files

Test results

Uh oh!

Uh oh!

qin-ctx commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

snemesh commented Mar 28, 2026 •

edited

Loading

`volcengine_vlm.py` — Retry + Responses API

`memory_react.py` — Accepted deletion